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(57) ABSTRACT 

A content-based classification system is provided that 
detects the presence of object images within a frame and 
determines the path, or trajectory, of each object image 
through multiple frames of a video segment. In a preferred 
embodiment, face objects and text objects are used for 
identifying distinguishing object trajectories. A combination 
of face, text, and other trajectory information is used in a 
preferred embodiment of this invention to classify each 
segment of a video sequence. In one embodiment, a hierar- 
chical information structure is utilized to enhance the clas- 
sification process. At the upper, video, information layer, the 
parameters used for the classification process include, for 
example, the number of object trajectories of each type 
within the segment, an average duration for each object type 
trajectory, and so on. At the lowest, model, information 
layer, the parameters include, for example, the type, color, 
and size of the object image corresponding to each object 
trajectory. In an alternative embodiment, a Hidden Markov 
Model (HMM) technique is used to classify each segment 
into one of a predefined set of classifications, based on the 
observed characterization of the object trajectories con- 
tained within the segment. 

18 Claims, 4 Drawing Sheets 
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PROGRAM CLASSIFICATION USING 
OBJECT TRACKING 



BACKGROUND OF THE INVENTION 
1. Field of the Invention 

This invention relates to the field of communications and 
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information processing, and in particular to the field of video 
categorization and retrieval. 
2. Description of Related Art 

Consumers are being provided an ever increasing supply 
of information and entertainment options. Hundreds of 
television channels are available to consumers, via 
broadcast, cable, and satellite communications systems. 
Because of the increasing supply of viewing options, it is 
becoming increasingly more difficult for a consumer to 
locate programs of specific interest. A number of techniques 
have been proposed for easing the selection task, most of 
which are based on a classification of the available 
programs, based on the content of each program. 

A classification of program material may be provided via 
a manually created television guide, or by other means, such 
as an auxiliary signal that is transmitted with the content 
material. Such classification systems, however, are typically 
limited to broadcast systems, and require the availability of 
the auxiliary information, such as the television guide or 
other signaling. Additionally, such classification systems do 
not include detailed information, such as the time or duration 
of commercial messages, news bulletins, and so on. A 
viewer, for example, may wish to "channel surf* during a 
commercial break in a program, and automatically return to 
the program when the program resumes. Such a capability 
can be provided with a multi-channel receiver, such as a 
picture-in-picture receiver, but requires an identification of 35 
the start and end of each commercial break. In like manner, 
a viewer may desire the television to remain blank and silent 
except when a news or weather bulletin occurs. Conven- 
tional classification systems do not provide Sufficient detail 
to support selective viewing of segments of programs. 

Broadcast systems require a coincidence of the program 
broadcast time and the viewer's available viewing time. 
Video recorders, including multiple-channel video 
recorders, are often used to facilitate the viewing of pro- 
grams at times other than their broadcast times. Video 
recorders also allow viewers to select, specific portions of 
recorded programs for viewing. For example, commercial 
segments may be skipped while viewing an entertainment or 
news program, or, all non-news material may be skipped to 
provide a consolidation of the day's news at select times. 
Conventional classification systems are often incompatible 
with a retrieval of the program from a recorded source. The 
conventional television guide, for example, provides infor- 
mation for locating a specific program at a specific time of 
day, but cannot directly provide information for locating a 
specific program on a recorded disk or tape. As noted above, 
the conventional guides and classification systems are also 
unable to locate select segments of programs for viewing. 



facilitate the classification of a program based on the clas- 
sification of multiple segments within the program. 

The object of this invention, and others, are achieved by 
providing a content-based classification system that detects 
5 the presence of objects within a frame and determines the 
path, or trajectory, of each object through multiple frames of 
a video segment. In a preferred embodiment, the system 
detects the presence of facial images and text images within 
a frame and determines the path, or trajectory, of each image 
10 through multiple frames of the video segment. The combi- 
nation of face trajectory and text .trajectory information is 
used in a preferred embodiment of this invention to classify 
each segment of a video sequence. To enhance the classifi- 
cation process, a hierarchical information structure is uti- 
lized. At the upper, video, information layer, the parameters 
used for the classification process include, for example, the 
number of object trajectories of each object type within the 
segment, an average duration for object type trajectory, and 
so on. At the lowest, model, information layer, the param- 
eters include, for example, the type, color, and size of the 
object corresponding to each object trajectory. In an alter- 
native embodiment, a Hidden Markov Model (HMM) tech- 
nique is used to classify each segment into one of a pre- 
defined set of classifications, based on the observed 
characterization of the object trajectories contained with the 
segment. 
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BRIEF SUMMARY OF THE INVENTION 

It is an object of this invention to provide a method and 
system that facilitate an automated classification of content 
material within segments, or clips, of a video broadcast or 
recording. The classification of each segment within a 
broadcast facilitates selective viewing, or non-viewing, of 
particular types of content material, and can also be used to 
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BRIEF DESCRIPTION OF THE DRAWINGS 

The invention is explained in further detail, and by way of 
example, with reference to the accompanying drawings 
wherein: 

FIG. 1 illustrates an example block diagram of an image 
processor for classifying a sequence of image frames based 
on object trajectories. 

FIG. 2 illustrates a block diagram of an example classifier 
for classifying a sequence of image frames based on Hidden 
Markov Models. 

FIG. 3 illustrates a block diagram of an example face 
tracking system for determining face trajectories in a 
sequence of image frames. 

FIG. 4 illustrates a block diagram of an example text 
tracking system for determining text trajectories in a 
sequence of image frames. 

Throughout the drawings, the same reference numerals 
indicate similar or corresponding features or functions. 

DETAILED DESCRIPTION OF THE 
INVENTION 

FIG. 1 illustrates an example block diagram of an image 
processor 100 for classifying a sequence of image frames 
based on object trajectories. The object that is tracked 
through the sequence of image frames can be any type of 
object that facilitates an identification of the class to which 
the sequence of image frames belongs. For example, figure 
tracking can be used to identify and track the moving figures 
within each sequence of images, to distinguish, for example, 
between a segment of a football game and a segment of a 
cooking show. It has been found that the trajectories of face 
objects and text objects are particularly well suited for 
distinguishing among common television program classes. It 
has also been found, as discussed below, that face objects 
and text objects have significantly different characteristics 
with regard to the partitioning of sequences of image frames 
into classifiable segments. Because face and text trajectories 
are particularly well suited for program classification and 
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each requires somewhat different processing, face and text classification 201 of each segment, or sets of segments 202 
trajectories are herein used as paradigms for different object of the video stream 10. For example, an anchor person in a 
trajectories. As will be evident to one of ordinary skill in the news segment is often presented in a medium distance shot 
art, the principles presented herein are applicable to other with insubstantial movement, as contrast to a situation 
object types, such as human figure objects, animal figure 5 comedy that may also typically include a medium distance 
objects, vehicle figure objects, hand (gesture) objects, and so shot, but usually with significantly more movement than an 
0D ' anchor person shot. In like manner, a weather newscaster is 
The example image processor 100 includes a video seg- often shown in a longer distance shot, with gradual move- 
menter 110, a face tracking system 300, a text tracking ment side to side; a commercial segment may also include 
system 400, an "other object" tracker 500 and a classifier 10 a long distance shot with gradual side to side movement, but 
200. For ease of reference and understanding, because face the length, or duration, of the commercial segment cbntain- 
tracking and text tracking serve as the paradigm for tracking ing the long distance shot is typically significantly shorter 
other objects, the "other object" tracker 500 and correspond- than a weather report. In a similar manner, collections of 
mg "other" trajectories 501 are not discussed further herein, segments can be grouped to form a single segment for 
their function and embodiment being evident to one of 15 classification. For example, a trio of segments comprising a 
ordinary skill in the art in light of the detail presentation medium distance shot with insubstantial movement fol- 
low of the function and embodiment of the face 300 and lowed by a very long distance shot with somewhat random 
text 400 tracking systems, and corresponding face 301 and face trajectories, followed by a medium distance shot with 
text 401 trajectories. multiple face trajectories, can be determined to be an anchor 
The video segmenter 110 in the example processor 100 ^ person introducing a news story, followed by footage of the 
identifies distinct sequences of a video stream 10 to facilitate news event, followed by a reporter conducting on-the-scene 
the processing and classification process. The video seg- interviews. Having made this determination, the classifier 
menter 110 employs one or more commonly available 200 groups these three segments as a single segment with a 
techniques, such as cut detection, to identify "physical" "news" classification 201. Subsequently, having determined 
segments, or shots, within the stream 10. This physical 25 a multitude of such news segments separated by commercial 
segmentation is preliminary. In a soap opera program, for segments, the classifier 200 classifies the set of these news 
example, a dialog between two people is often presented as segments as a program with a "news" classification 202. 
a series of alternating one-person shots, whereas the The particular choice of classes, and the relationship 
sequence of these shots, and others, between two commer- among classes, for the classification process is optional. For 
cial breaks forms a "logical" segment of the video stream 10. 30 example, a " weather" classification can be defined in some 
Physical segmentation facilitates the processing of a video systems, so as to distinguish weather news from other news- 
stream because logical segments, in general, begin and end similarly, "sports-news", "market-news", "political-news" 
on physical segment boundaries. Note that at various stages etc. classifications can be defined. These classifications may 
of the processing of the frames of the video stream 10, the be independent classifications, or they may be subsets of a 
bounds of the segments may vary, and segments may merge 35 news classification in a hierarchical classification system In 
to form a single segment, or split to form individual seg- like manner, a matrix classification system may be utilized 
ments. For example, until the series of alternating one wherein a "sports-news" classification is related to a "news" 
person shots are identified as a dialog segment, they are family as well as a "sports" family of classifications In 
referred to as individual segments; in like manner, individual like-manner, some classifications may be temporary, or 
shots with a common text caption form a common segment 40 internal to the classifier 200. For example, the introduction 
only after the caption is recognized as being common to each to programs can often be distinguished from other segments, 
segment. Note also that a segment, or sequence of image and an initial "introduction" classification applied When 
frames, need not be a contiguous sequence of image frames. subsequent segments are classified, the classification of the 
For example, for ease of processing or other efficiency, a subsequent segments is applied to the segment having the 
sequence of image frames forming a segment or program 45 interim "introduction" classification. Note also that a clas- 
segment may exclude those frames classified as commercial, sified segment may include sub-segments of the same or 
so hat the noncommercial frames can be processed and different classifications. A half-hour block of contiguous 
classified as a single logical segment. frames may be classified as a "news" program, or segment, 
The face tracking system 300 identifies faces in each and may contain news, sports-news, commercial, and other 
segment of the video stream 10, and tracks: each face from 50 segments; similarly, a sports-news segment may include a 
frame to frame in each of the image frames of the segment sequence of non-contiguous, non-commercial frames com- 
The face tracking system 300 provides a face trajectory prising baseball-news, football-news, and so on. 
301 for each detected face. The face trajectory 301 includes In a preferred embodiment the classification organization 
such trajectory information as the coordinates of the face is chosen to further facilitate the determination of classifi- 
within each frame, the coordinates of the. face in an initial 55 cations of segments and^ets of segments. For example, a 
frame and a movement vector that describes the path of the common half hour news format is national news, followed 
face through the segment, and/or more abstract information by sports, followed by weather news, then followed by local 
such as a characterization of the path of the face, such as news, with interspersed commercial segments. If the clas- 
'medium distance shot, linear movement", or "close-up sifier 200 detects this general format within a half hour 
shot, off-center, no movement", and so on. Other trajectory 6 0 period of video segments, segments within this period that 
information, such as the duration of time, or number of had been too ambiguous to classify are reassessed with a 
frames that the face appears within the segment are also strong bias toward a news or commercial classification, and 
included in the parameters of each face trajectory 301, as a bias against, certain other classifications, such as a soap 
well as characteristics associated with each face, such as opera or situation comedy classification, 
color, size, and so on. 65 A variety of conventional techniques, as well as novel 
Ihe classifier 200 uses the face trajectories 301 of the techniques, presented below, can be utilized to effect the 
various segments of the video stream 10 to determine the classification process. Expert systems, knowledge based 
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systems, and the like are particularly well suited to provide parameters such as type, color, size, and location associated 

a multivariate analysis for classifying video segments based with the each object element corresponding to each object 

on the parameters associated with face trajectories. At a trajectory are used to facilitate classification. Other hierar- 

more analytical level, statistical techniques, such as multi- chy levels, such as a program level, having parameters such 

variate correlation analysis, and graphic techniques, such as 5 as the number of particular-segment sequences, may also be 

pattern matching, can be used to effect this classification. For provided to facilitate classification. 

example, a plot of the location of faces in a sequence of In a preferred embodiment, a multi-dimensioned feature 

image frames over time demonstrates distinguishable pat- space is defined, wherein the features are selected to provide 

terns common to particular classifications. As noted above, separability among the defined classifications. It has been 

a long sequence of gradual left and right movements of a id found that the number of object trajectories for each object 

face at a distance has a high correlation with a weather tv P e P cr unit timc ^ their average duration are fairly 

report, while a short sequence of somewhat random move- effective separating features because they represent the 

ments has a high correlation with a commercial segment " de n sit y" of particular objects, such as faces or text, in the 

The graphic representation of each of these sequences 5 fg meIl, of video stream. Additionally, it has been found 

provides easily distinguishable patterns. The embodiment of is • tra J ectones of i° n g duration usually convey more 

these and other conventional analysis and classification TT u ^ nnatl0 ° » r ™teo, and a preferred 

techniques to the classifier 200 will be evident to one of .hT^ °a ^V^U^^T' 

^-ii • .u * r j- i tones with a duration that exceeds a threshold, and their 

ordinary skdl in the art in v.ew of this disclosure. respective average duration as an effective fea . 

Also illustrated in FIG. 1 is a text tracking system 400. As hire. Additionally, particular features of particular object 
in the face tracking system 300, the text tracking system 400 20 types can be used to further facilitate the classification 
determines the presence of text material in a segment of the process. For example, the number of face trajectories with 
video stream 10, and provides a text trajectory correspond- shots providing closer than shoulder images is utilized in a 
ing to the path of each text element through a sequence of preferred embodiment, because it has been found that close- 
frames. As distinguished from the face tracking system 300, up shots are particularly effective for classification 
the text tracking system 400 is less sensitive to the segmen- a A conventional "nearest neighbor" parametric classifica- 
tation cues provided by the segmenter 110, because text ,; 0 n approach has been found to be-effective and efficient for 
material often extends across various cuts and shots. For program classification. Based on experience, heuristics, and 
example the credits at the end of a program, and character other factors, the center of the parameter space correspond- 
introducuons at the start of a program, are commonly ^ to each feature fc determined . A g f ven is 
presented in a foreground white a senes of short clips are 3 „ characterized using these defined features, the vector dis- 
I'TrZ ™£f k8r r^ ^.fT^S ° fi f text P f0V i de t s tance to each of the classification centers is determined, and 
a strong suggestion of a credits" classification, which the segment is classified as the classification having the 
strongly _ overcomes other classifications of the segments that closest center. Heuristics are also applied in a preferred 
occur whde the text is being scrolled. As in the face tracking embodiment to verify that the classification determined by 
Srf^kni ^8 astern 400 provides text this parametric approach is reasonable, based for example, 
frajectones 401 corresponding to each text element that is 35 on the surrounding context or other factors, 
detected and tracked through segments of the video stream , n aQ embodimem> Hidden Mafkov Modds 

Th» ^i.«;fi.,->Aft,.~o --.i. .u e , (HMMs) are used to facilitate the classification process. The 

,h^t, S 5m f °£ fSCC ^Jf 010 " 65 °l Hidden Mark ° v Model approach is particularly well suited 

the text trajectories 401 or, preferably, a combination of both t ,„„- R ,• . j . • . • . ' 

(and other trajectories 501), to provide the classification of «0 S^^l " \ ? T^u ^"2° j™^ 0 ™* 

segments of the video stream 10. Note that, as might occur tem P oral evenls > ™* Hidden Markov Model 

with scrolling text, the segments containing different text ^f™^ incorporates a time-varying model. In a preferred 

elements may overlap, and may or may not correspond to the embodiment of this invention, a set of "symbols" is defined 

segments associated with face elements. The classifier 200 corresponding to a set of characterizations, or labels, for 

applied a variety of techniques to effect the above referenced 45 cnarac terizing each frame in a segment. In a preferred 

reformulation and classification of segments. Heuristic embodiment that utilizes face and text objects, the symbols 

techniques, including expert systems, knowledge based include: 

systems, and the like, are particularly well suited for such 1- Anchor person with text; 

segmentation reformulation techniques. 2 . One or more people, long shot, with text; 

As a^scussed above the classification of a segment, and 50 3. Wide close-up (shoulders and above) without text; 

the definition/bounds of the segment, is preferably based on A ^ /u* -iuv-._ 

individual object trajectories,^ well as the relationships 4 " Cl °^ (cheSt and aboVe > Wlthout text ' 

among trajectories within a segment, or the relationships 5 - Three or more peopk without text; 

among segments. In one preferred embodiment of this 6. Two people, long shot, without text; 

invention, the classifier 200 utilizes a hierarchical multivari- 55 7. One or more people, medium close (above waist)* 

w h ^ c ^ cl f assifier !t I0CC ^ l u e obj ? 1 8 - No face ' more than four lines of 

trajectories 301, 401, 501 to form a three level hierarchy 0 Mrt , . e - 

comprising a video level, a trajectory level, and a model 9 " N ° faCe ' tW0 10 four hnes of text ^ 

level. At the video level, parameters such as a count of the 10 * No face ' one ^ of texl ; 

number of face, text, and other object type trajectories in 11- Black or white screen, little variation; 

each segment, the count of the number of object trajectories 60 12 Initial frame of shot- 

of each type (face, text, etc.) per unit time, an average 13 . 0ne , ^ witto 

duraUonofeach object type trajectory, an average length of id M f 

each segment forming a merged segment, and so on, are also ' no text; 

used to facilitate classification. At trajectory and a charac- 15. Other. 

terization of each object trajectory (still, linear movement, 65 FIG. 2 illustrates a block diagram of an example classifier 

random movement, zoom-in/out, side-to-side, scrolling, 200' for classifying a sequence of image frames based on 

etc.) are used to facilitate classification. At the model level, HMMs. In the example classifier 200', four classification 



09/02/2004, EAST Version: 1.4.1 



US 6,754,389 Bl 
7 8 

2f!S?/?»? fi 5^ : nCWS * n™™™*** sitcom, and soap. An FIG. 3 illustrates an example block diagram of an 
HMM 220a-d is provided for each classification. Using example face tracking system 300 for determining face 
techniques common in the art, each HMM 220a-d is trained trajectories in a sequence of image frames. The example face 
by providing sample sequences of image frames having a tracking system 300 of FIG. 3 includes a face detector 320 
known classification. Internal to each HMM 220a^d is a 5 a face modeler 350, and a face tracker 360. In a preferred 
state machine model having a transition probability distri- embodiment of this invention, the face tracking system 300 
buuon matrix and a symbol observation probability distri- uses the segmentation of, the video stream 10 provided by 
button matrix that modek the transition between states and the segmenter 110 to facilitate face tracking, because most 
the generation of symbols. The traming process adjusts the facial images begin and end at the physical cut boundaries 
parameters of the transition probability distribution matrix 10 typically identified by the segmenter 110. In this example 
and a symbol observation probability distribution matrix, embodiment, the segmenter U0 provides a start signal to the 
and the initial state of the state machine, to maximize the face detector 340. In response to this start signal, the face 
probability of producing the observed sequences corre- detector scans an initial frame 11 of the segment to identify 
spending to the sample sequences of the known classifies one. or more faces in the initial frame U. In the example 
U0 °' f is embodiment of FIG. 3, the face detector 320 comprises a *~ 

Aiter each HMM 721Sa-d is suitably trained, a new skin tone extractor and smoother 330, and a shape analyzer 
segment 10' is classified by providing a sequence of obser- 340. The skin tone extractor and smoother 330 identifies 
vation symbols corresponding to the new segment 10' each portions of the initial image frame IT containing flesh colors, 
HMM 220a-rf. The symbol generator 210 generates the and smoothes the individual pixel elements to provide skin 
appropriate symbol for each frame of the sequence of frames 20 regions to the shape analyzer 340. The shape analyzer 340 
forming the segment 10' using, for example, the above list of processes the identified skin regions to determine whether 
symbols. If an image can be characterized by more than one each region, or a combination of adjacent regions, form a 
symbol, the list of symbols is treated as an ordered list, and face image. 

the first symbol is selected as the characterizing observation As indicated in FIG. 3, the face detection process includes" 
symbol. For example, if the image contains one person in a 25 iterations through the extraction 330 and analyzing 340 
medium close shot with no text (symbol 7, above), and processes, and is typically a time consuming procelk To 
another person in a close-up shot with no text (symbol 4, minimize the time required to find and identify a face in each 
above), the image is characterized as a close-up shot with no subsequent image frame, the face modeler 350 and face 
text (symbol 4^ In response to the sequence of observation tracker 360 are configured to use predictive techniques for 
symbols, each HMM 22fW provides a probability measure 30 determining each face trajectory. After the face detector 320 
that relates to toe likelihood that this sequence of observed locates and identifies a face within the initial image frame 
symbols would have been produced by a video segment n, the face modeler 350 predicts the location of the face in 
having the designated classification. A classification selector the next subsequent image frame 12. Initially, lacking other 
250 determines the segment classification 201 based on the information, the location of the face in the next subsequent 
reported probabilities from each HMM 220a-d. Generally, 35 frame 12 is predicted to be the same as the location of the 
the classification corresponding to the, HMM having the face in the initial frame 11. Rather than searching the entire 
highest probability is assigned to the segment, although next image frame for the identified face 321, the face tracker 
other factors may ako be utilized, particular when the 360 searches only within the vicinity of the predicted 
»^ D ^ m °? g tbchl g hestre P° rtcd Probabilities from the location 351 for this identified face 321. In a preferred 
HMMs 220^ are not significantly different, or when the 40 embodiment of this invention, the face tracker 360 utilizes 
highest reported probability does not exceed a minimum a significantly simpler, and therefore faster, technique for 
a ° ^1 J determining the presence of a face, compared to the process 

As will be evident to one of ordinary skill in the art in used in the face detector 320. Within the vicinity of the 
view of this disclosure, additional and/or alternative obser- predicted face location, the individual picture elements 
vation symbol sets, and additional and/or alternative classi- 45 (pixels) are classified as "face" or "non-face", based on their 
fkation types may be utilized within the example construct deviation from the characteristics of the identified face 321. 
of FIG. 2. If the object types include human figures, for If a sufficient distribution of "face" pixels is detected in the 
example, a symbol representing multiple human figure vicinity of the predicted location, the distribution of face 
objects colliding with each other would serve as an effective pixels is declared to be the identified face 321, and the 
symbol for distinguishing-segments of certain sports from 50 location of this distribution of face pixels in the subsequent 
other sports or from other classification types. In like frame 12 is determined. Note that, because the video seg- 
manner, other techniques for classifying segments, and sets menter 110 provides segmentation information, such as the 
of segments, based on object trajectories may be utilized in location of cuts, the likelihood of mistaking a different face 
conjunction with, or as an alternative to, the hierarchical in a subsequent frame 12 from the identified face 321 within 
parametric-technique and/or the HMM technique presented 55 the vicinity of the predicted location 351 is minimal 
herein for the classification of segments of video based on When the identified face 321 is located in the next 
object trajectories. subsequent frame 12, the face tracker 360 provides feedback 

Existing and proposed video encoding standards, such as 361 to the face modeler 350 to improve the modeler's 
MPEG-4 and MPEG-7, allow for the explicit identification predictive accuracy. This feedback 361 may be the deter- 
of objects within each frame or sequence of frames and their 60 mined location of the identified face 321 in the frame 12 a 
corresponding movement vectors from frame to frame. The differential parameter related to the prior location and so on 
following describes techniques that can be utilized in addi- In a preferred embodiment of this invention, the face mod- 
tion to, or in conjunction with, such explicit object tracking eler applies appropriate data smoothing techniques, such as 
techniques for tracking the paradigm face and text object a Kalman filter, to minimize the effects of slight or interim 
types. The application of these techniques, and others, to 65 movements. The face modeler 350 provides the next pre- 
ldentify and track other,object types will be evident to one dieted location 351, based on the feedback 361 from the face 
of ordinary skill in the art in view of this disclosure. tracker 360, to the face tracker 360 to facilitate the identi- 
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fication and location of the face identification 321 in the next text elements. The character detector 420 identifies the 

subsequent frame 12, and the process continues. presence of character-like elements formed by the identified 

The face tracker 360 of FIG. 3 is also configured to restart edges. The text box detector 430 identifies regions of the 

the face detection process in the face detector 320 This ■ . 4 t . , 

. . i <r < j i_ I t , t/r, image containmg somewhat contiguous characters, and the 

restart may be effected whenever the face tracker 360 fails 5 w Atx ^ a ah -a l- r * 7. 

to locate a face within the vicinity of the predicted location f utentifies combinations of text boxes 

351, or in dependence upon other factors that arc correlated f °J mmg one or more lmes of «""«uoiis text. These iden- 

to the appearance of new faces in an image. For example, afied lmes of text m formulated into text models by the text 

MPEG and other digital encodings of video information use modeler 450. The text models include, for example, the 
a differential encoding, wherein a subsequent frame is 1D co ' or ^ size of each text line, and may also include an 

encoded based on the difference from a prior frame. If a identification of the actual characters forming each text line, 

subsequent frame 12 comprises a large encoding, indicating 1° this example system 400, the above processes are 

significant changes, the face tracker 360 may initiate a repeated for each frame of the sequence of image frames, 

restart of the face tracker to locate all faces in this subse- because these edge and character based processes typically 
quent frame 12. In response to this restart signal, the face 15 require very little time to complete. The text modeler 450 
detector updates the set of face identifications associated . reports the location of each text line, if any, in each frame to 

with the current segment to remove identified faces that . the text tracker 460. The text tracker 460 formulates the text 

were not found in the subsequent frame 12, or to add trajectory information 401 in dependence upon the tech- 

identified faces that had not been previously located and n i ques and parameters employed by the classifier 200 of 
identified in pnor frames. m FIG. 1, as discussed above. Illustrated in FIG. 4, the text 

By mmunizing the region in which to search for each tracker 460 optionally provides feedback 461 to the text 

identified face 321 111 each subsequent frame 12 and by modeler 450 to facilitate a determinal i on of whelhe r each 

minimizing the complexity of the identification task for each ;rw;fi,*H 1™ *v *k . *r j • 
subsequen? frame 12, the face tracker 360 can provide a 

continuous and efficient determination of the location of « P^^.^nUfied. This feedback 461 ^ particularly use- 
each identified face in each subsequent frame 12. Other ^ fo^aintaming ; a correspondence of text elements that 
optimization techniques may also be applied. For example, are scroU ^ Note thal the Performance of the text 
the search region about the predicted location 351 can be svstem 400 mav be improved by employing some 
dynamically adjusted, based on a confidence factor associ- 0r aU of ^ OP 1 ™ 12 ^ 1011 techniques presented with respect 
ated with the predicted location 351. If, for example, the 30 t0 the face lrackirj S system 300. For example, the elements 
identified face is determined to be stationary for 100 frames, 410-440 can be configured to identify text in initial frames, 
the "prediction" that the face will be located at the same' and tne text trac ker 460 can be configured to process 
location in the 101" frame has a higher confidence factor subsequent frames based on the identified text elements of 
than the initial default prediction that the face will be located P^or frames, using for example conventional pattern or 
at the initial location in the 2 nd frame, and thus the search 35 character matching techniques. Other optimization 
region of the 101*' frame can be smaller than the initial techniques, such as the use of MPEG -provided measures of 
search region of the 2 nd frame. In like manner, if the face is differences between frames and the like, may also be 
moving quickly and somewhat randomly in the sequence of employed. The determination of whether to employ such 
100 frames, the predicted location for the location of the face optimization techniques will depend upon the overhead 
in the 101" frame is less reliable, and a wider search region 40 burden associated with the technique as compared to the 
about the predicted location in the 101" frame would be expected gain in performance provided by the technique, 
warranted. In like manner, the aforementioned MPEG dif- 

ferential coding can be used to eliminate the need to search ^ e foregoing merely illustrates the principles of the 

select subsequent frames 12 when these frames indicate little invention. It will thus be appreciated that those skilled in the 

or no change compared to their prior frames. 45 art will be able to devise-various arrangements which, 

The form and content of the produced face trajectories although not explicitly described or shown herein, embody 

301 from the face tracker 360 will be dependent upon the me principles of the invention and are thus within its spirit 

techniques used by the classifier 200 in FIG. 1, and the and scope. For example, the principles presented herein can 

parameters required by these techniques. For example, using be combined with other image characterization and classi- 

an HMM classifier 200 1 as presented in FIG. 2, the face 50 fication techniques and systems, and other parameters may 

trajectories 301 will preferably contain information related also be used in the classification process. The number of 

to each frame of the segment, but the information can merely optical cuts, such as fade, dissolve, and wipe, or the per- 

be the location of the face relative to a distance from a centage of such cuts within a segment has been found to be 

camera, and may be characterized as "close-up", "medium- very effective for distinguishing news and commercials from 

close , long", and so on. Using a parametric classifier 200, 55 other classifications. In like manner, the applications of the 

the face trajectories 301 may contain a synopsis of the disclosed techniques are not necessarily limited to the 

movement of the face, such as "fixed", "moving laterally", examples provided. For example, as discussed above a 

approaching", "leaving-, and so on, or it may contain the pTOgTim classification, such as a news program, may be 

determined location of the face in each frame of the seg- characterized by a sequence of sub-segments of particular 

ment. The appropriate information to be contained in the 6 0 classification types. The classification of each sub-segment 

face trajectories 301 will be evident to one of ordinary skill can form observation symbols, and Hidden Markov Models 

in the art in light of the selected method employed to effect can be defined that model sequences of observed sub- 

the segment classification. sclent classifications corresponding to each program clas- 

FIG. 4 illustrates an example block diagram of a text sification type. These and other system configuration and 

tracking system 400 for determining text trajectories 401 65 optimization features will be evident to one of ordinary skill 

within a sequence of images. The edge detector and filter in the art in view of this disclosure, and are included within 

410 identifies the presence of edges that are characteristic of the scope of the following claims. 
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We claim: 

1. A method of classifying content material of a sequence 
of image frames, comprising: 

identifying at least one object image in an initial image 
frame of the sequence of image frames, the at least one 5 
object image including one or more regions of facial 
color in the initial image frame and the one or more 
regions of facial color are processed to identify a select 
region as the at least one object image, 

determining at least one object trajectory associated with 
the at least one object image based on subsequent 10 
frames of the sequence of image frames, and 

maintaining a video level parameter associated with the 
sequence of image frames, the video level parameter 
including: 

an object trajectory count of the one or more object is 
trajectories, 

an average duration of the one or more object 

trajectories, and 
a frame count of the sequence of image frames; and 
classifying the content material of the sequence of image 20 
frames based on the video level parameters. 

2. The method of claim 1, wherein 

the at least one object image corresponds to at least one 
of: a face image, a figure image, and a text image. 

3. The method of claim 1, wherein 25 
determining the at least one object trajectory includes 

identifying an initial location of the at least one object 
image, 

identifying a target region of each subsequent frame of 
the subsequent frames, 30 

identifying the at least one object image in the target 
region and a corresponding subsequent location of 
each subsequent frame containing the at least one 
object image, 

determining the at least one object trajectory based on 35 
the initial location and the subsequent location of the 
at least one object image in each subsequent frame. 

4. The method of claim 3, wherein 

identifying the target region of each subsequent frame is 
based on the at least one object trajectory. 40 

5. The method of claim 1, wherein 
classifying the sequence includes: 

identifying a set of classes within which the sequence 
of image frames is classified, 

determining a class location within a vector space 45 
corresponding to each class of the set of classes, each 
class location corresponding to features of the cor- 
responding class, 

classifying the sequence based on a vector distance 
between features of the at least one object trajectory 50 
and each class location corresponding to each class. 

6. The method of claim 1, wherein 
classifying the sequence includes: 

generating a sequence of symbols corresponding to the 
at least one object trajectory, 55 

providing the sequence of symbols to a plurality of 
markov models, each model of the plurality of 
markov models providing a statistic based on the 
plurality of markov models, and 

classifying the sequence of image frames based on the 60 
statistics provided by the plurality of markov mod- 
els. 

7. The method of claim 1, wherein 

identifying the at least one object image corresponding to 
a text image includes: 55 
identifying distinct edges in an image frame of the 
sequence of image frames, 
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identifying portions of the image frame that contain 

character elements, based on the distinct edges, 
identifying the text box based on the portions of the 
image frame that contain character elements. 
8. A method of classifying a sequence of image frames, 
the sequence of image frames comprising one or more object 
trajectories, the method comprising: 
maintaining a set of parameters associated with the 
sequence of image frames, the set of parameters includ- 
ing at least one of: a video level parameter, a trajectory 
level parameter, and a model level parameter, 
the video level parameter including at least one of: 
an object trajectory count of the one or more object 
trajectories, 

an average duration of the one or more object 

trajectories, and 
a frame count of the sequence of image frames; 
the trajectory level parameter including at least one of: 
an object trajectory duration associated with each 

object trajectory of the one or more object 

trajectories, and 
a characterization of the each object trajectory of the 

one or more object trajectories; and 
the model level parameter including at least one of: 
an object type associated with each object trajectory, 
an object color associated with each object trajectory of 

the one or more object trajectories, 
an object location associated with each object trajectory 

of the one or more object trajectories, and 
an object size associated with each object trajectory of 

the one or more object trajectories; and 
classifying the sequence of image frames based on the set 
of parameters. 

9. The method of claim 8, wherein 

the one or more object trajectories corresponds to at least 
one of: a face trajectory, a figure trajectory, and a text 
trajectory. 

10. The method of claim 8, wherein 

the set of parameters further includes a measure associ- 
ated with a count of optical cuts within the sequence of 
image frames. 

11. An image processor for classifying content material of 
a sequence of image frames, comprising: 

an object identifier that is configured to identify at least 
one object image in an initial image frame of the 
sequence of image frames, the object identifier includ- 
ing a skin tone identifier that is configured to identify 
one or more regions of facial color in the initial image 
frame, and a shape analyzer that is configured to 
process the one or more regions of facial color to 
identify a select region as the at least one object image, 

an object tracker that is configured to provide at least one 
object trajectory associated with the at least one object 
image based on subsequent frames of the sequence of 
image frames, wherein said object tracker maintains a 
set of parameters associated with the sequence of image 
frames, 

the set of parameters including at least one of: a video 
level parameter, a trajectory level parameter, and a 
model level parameter, 

the video level parameter including at least one of: 
an object trajectory count of the one or more object 
trajectories, 

an average duration of the one or more object 

trajectories, and 
a frame count of the sequence of image frames; 
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the trajectory level parameter including at least one of: 
an object trajectory duration associated with each 

object trajectory of the one or more object 

trajectories, and 
a characterization of the each object trajectory of the 5 

one or more object trajectories; and 
the model level parameter including at least one of: 
an object type associated with each object trajectory, 
an object color associated with each object trajectory of 

the one or more object trajectories, io 
an object location associated with each object trajectory 

of the one or more object trajectories, and 
an object size associated with each object trajectory of 

the one or more object trajectories: and 
a classifier that is configured to classify the content 15 
material of the sequence of image frames based on the 
set of parameters. 

12. The image processor of claim U, wherein 

the at least one at least one object image corresponds to 
at least one of: a face image, a figure image, and a text 
image. 

13. The image processor of claim 11, wherein 

the object tracker determines the at least one object 



trajectory in an iterative manner, based on an initial 
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location of the at least one object image in the initial 
image frame and one or more subsequent locations of 
the at least one object image in one or more subsequent 
frames of the sequence of image frames, and 
the image processor further includes 30 
an object modeler, operably coupled to the object 
tracker, that is configured to 

identify the initial location of the at least one object 
image, and 

identify a target region of each next frame of the one 35 
or more subsequent frames, based on the initial 
location and the at least one object trajectory, to 
facilitate a determination of a next location of the 
one or more subsequent locations within the target 
region. 40 
14. The image processor of claim 11, wherein 
the classifier is further configured to 
maintain a hierarchy of object trajectory information 

based on the at least one object trajectory, and 
classifies the sequence based on parameters at each 45 
hierarchy of object trajectory information. 



15. The image processor of claim 11, wherein 
the classifier further includes: 

a symbol generator that is configured to generate a 
sequence of symbols corresponding to the at least 
one object trajectory, and 
a plurality of markov models, each model of the 
plurality of markov models being configured to 
determine a statistic based on the sequence of sym- 
bols corresponding to the at least one object 
trajectory, and 
wherein 

the classifier classifies the sequence of image frames 
based on the statistics provided by the plurality of 
markov models. 

16. The image processor of claim 11, wherein 
the object identifier includes 

an edge detector that is configured to identify distinct 
edges in an image frame of the sequence of image 
frames, 

a character detector that is configured to process the 
distinct edges and to identify therefrom portions of 
the image frame that contain character elements, and 
a text box detector that is configured to identify a text 
box based on the portions of the image frame that 
contain character elements, the text box correspond- 
ing to the at least one object image, and 
wherein the object tracker is configured to determine the 
at least one object trajectory based on one or more 
locations of the text box in one or more subsequent 
frames of the sequence of image images. 

17. The image processor of claim 16, wherein the object 
identifier further includes 

a text line identifier that is configured to locate contiguous 
text boxes and provide therefrom a text line identifier to 
the object tracker to further facilitate the determination 
of the at least one object trajectory. 

18. The image processor of claim 11, wherein 

the classifier classifies the sequence of image frames 
based on a vector distance between features of the at 
least one object trajectory and a class location corre- 
sponding to features of each identified class within 
which the sequence of image frames is classified. 
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